Climate Ocean Modeling on a Beowulf Class System 
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Abstract With the growing power and 
shrinking cost of personal computers , the 
availability of fast ethernet interconnec- 
tions , and public domain software packages , 
it is now possible to combine them to build 
desktop parallel computers (named Beowulf 
or PC clusters) at a fraction of what it 
would cost to buy systems of comparable 
power from supercomputer companies. This 
led us to build and assemble our own sys- 
tem, specifically for climate ocean model- 
ing. In this article, we present our experi- 
ence with such a system, discuss its network 
performance, and provide some performance 
comparison data with both HP SPP2000 
and Cray TdE for an ocean model used in 
present-day oceanographic research. 
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1 Introduction 

Beowulf class systems consisting of clusters of 
off-t he-shell P( 's are becoming a regular fix 
lures in research and industrial computing. 
Iraditional supercompulers are refrigerator- 
'i/e caLitiets that contain thousands of micro- 
processors. 1 hcse supercomputers are Lnih 
with specialized components and software that 
can he operated onk by expert technicians 
and programmers. I hese machines iisualk 
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have their own cooling systems, require large 
amounts of electricity and cost typically more 
than $1,000,000. The Beowulf approach repre- 
sents a new business model for acquiring com- 
putational capabilities, particularly for small 
to medium sized applications. It comple- 
ments rather than competes with the more con- 
ventional vendor-centric systerns-supplier ap- 
proach. 

At Jet Propulsion Laboratory (JPL). the 
ocean modeling group recently decided to build 
its own Beowulf system . the first one to built 
in house, mainly to run our increasingly com- 
plex ocean models. This system consists of Id 
Intel Celeron Pentium II PC's running at 300 
rnhz. interconnected by a 100 mbs fast ether- 
net network, with a total price of about $2(5 K. 
The popular Linux was chosen to be the op- 
erating system, and being publicly available, 
enabled much of the needed supporting soft- 
ware 4 to be downloaded from the internet free 
of charge. I he com mu meat ing programming 
model ot choice was the message passing inter- 
lace' or MPI. due to its portability. I he ocean 
model to he tested will be based on the Paral- 
lel Ocean Program (POP), which has been ex- 
tensively used at J PI. on t he ( ’ray Tol) Kami 
HP SPP200U parallel computers . It is antic- 
ipated that we can perform Miiall lo medium- 
si/e ocean modeling applications on such a Be 
ou uif cluster, which is complemented b\ the 
commercial massively parallel computers tor 



large and (*x l remel y large sized (killer) applica 
t inns. I he object ive of l his ant ii |e R I <> assess 
! hr perlorrna nrc < >f * n 1 1 * nrlwork and present 
"ome perlorrna nre rom[);iriM>n data with I m » 1 h 
Ml* SPP2000 < i i m 1 ( ' let \ L3E using a p I > n f ' ; 1 1 i < >n 
run.-* from our ocean model programs. 

2 Setting up the Beowulf 

Cluster 

Docu ment at ion lor setting u p a Beowulf fluster 
are widely available from various sources, and 
our setup mainly followed that in the "How to 
Build a Beowulf. A lutoriaP [ll by P.Angelino 
et al. All machine parts were ordered from 
local commercial vendors and brought in to 
•I PL for assembly. Most of the parts ordered 
were identical to that specified in [1]. with the 
exception of ours having faster Pentium chip 
f'PII Celeron). more RAM memory (I68mb). 
larger disks (4.3 Mb), and different network 
cards (3Com 905 B- IX Boomerang) and switch 
(24 port Superstack 11 3300). We also have the 
benefit of a newer version of the Linux op- 
erating system (Redhat 5.0). which was pur- 
chased with the Extreme Linux CDROM from 
Redhat Systems for less than 830. Installa- 
tion of the operating system on the main disk 
went as planned, the machine booted up as ex- 
pected. I he first major problem encountered 
was the discovery that the 3C905B network 
card drivers were not available on the CDROM 
distribution. A quick search thru the internet 
led to Donald Becker's site at (ioddard Space 
Might Center, who recently made available his 
latest version (v0.99I[) of the network driver 
that docs support this new card. It was now 
a routine matter to compile this new driver, 
and to install this as a moduli 4 to t tie kernel. 
A series oi tweaks followed in order maximize 
the network performance, and we found that 
in order for the card to detect a lOUnibs net- 
work link, the following must be modified in 
the 3c->4\ driver < an |e : 

r ldObase I V. Media .Ink. 0x02. 

\( \ R lUObaM'l- x. ■ II) 15 11/ i [ 0. J 


where ! I was changed to If). which allowed the 
card a little more time to detect the default 
link heat. I he next major problem was how to 
clone thr* first worki ng system tot lie 12 others, 
wit (lout having to remove t h<' hard drive and 
do a dd each time, as (lone in [Ij. A few hours 
oi research (again thru the internet) provided 
us with the idea of using t lie Trimix ( a diskless 
version of Linux, residing completely in RAM 
space) as a starting point for the cloning proce- 
dure. [ hi* basic step of t his process starts with 
installation of the Irinux on the new machine 
from a floppy drive, which included the drivers 
for the SCSI disk and network card. After the 
partitions were setup properly on the new ma- 
chine, the network driver is loaded, and a TCP 
connection is started with the machine to be 
cloned. The entire filesystem was then copied 
with the cpio command. This method is defi- 
nitely faster than using dd with large disks, as 
cpio is quicker, and has the major advantage of 
not having to disconnect the hard drives. With 
all 13 systems up, connected, and running, the 
last major task was to setup the automount 
file sharing system. This proves to be routine, 
as tlie documentation provided with the a mil 
software was sufficient. 

The Extreme Linux CDROM comes with 
both LAM and MPICH version of the MPI par- 
allel programming model, and we decided to go 
with the MPICH version, due to our previous 
experience with this software. We downloaded 
a later version (1.2) of MPICH from the Ar- 
gon ne National Labs website, and installed the 
program. An f90 compiler was also needed for 
compiling our ocean parallel programs, and t lie 
Absoft f9{) compiler was our choice. 

3 Network Performance 

We decided to first examine the network per- 
formance of t hR machine, using a similar tests 
as (lone in 2. I he llyglac machine oJPL - 
first Benwull. built at Caltech) network inter- 
lace cards iNIC-i <i re IMink cards with lulip 
chipset, while we mentioned before that our- 
an* ;u MOMi- 1 \ B< h niiera ngs. ( Mm pa ri ng t tie 
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Figure I: Sockets vs. \ I P I throughput perfor- 
mance 


results with those obtained in [2] with the JPL 
Hyglac beowulf cluster (Figure 1). the netperf 
program for BSD socket tests showed that our 
network has a higher throughput for packet 
size of less than 512 bytes, and reaching close 
to its peak rate by the time the packet size 
hit [2s bytes. The peak rate of 9. S7 Mbytes/s 
however was significantly less than the ll.s 
Mbytes/s observed with Hyglac network. For 
MPI send and receive performance, ours show 
slightly better rates tor up to [kbytes packet 
sizes, and at fj [kbytes and L2Sk bytes packets. 

1 here is a rather precipituous drop in perfor- 
mance between rS Kbytes and 32Kbyt.es packet 
sizes, and then moving back up to a peak rate 
ot 7.51 Mbytes/s for 25bkbytes packet, though 
the range where this rate drop occurred at a 
( li He rent location for the Hyglac network (be- 
tween 3*2 kbytes and 12Nl\byiesj. with maxi- 
mum recorded rate of s.3 Mbytes s. Possible 
causes ol this rate drop are socket buffer size 
and etheruet segment Aze \2\. (hough the exact 
cans' has yet to he determined. However, it is 
clear .it this { >< >t u t that it Independent mi tlie 
t>p« l oi N I ( n-ed. hi 'iimiiiap. our tna< Line 
will perform f tetter than Ikglac on program- 
com mimical ing wit h par km - less Hum i [\ 1 >v re- 
in -i/e. 


4 Description of the Ocean 
Model 

I he Ocean General ( 'ircti lat ion Model 
( ( ) ( i ( ' \ I ) is based on the Parallel Ocean 
Program (POP) developed at I. os Alamos 
National Laboratory [3]. I his ocean model 
evoked from tin' Bryan-Cox 3-dimensional 
primitive equations ocean model [1.5]. devel- 
oped at NO A A Geophysical Fluid Dynamics 
Laboratory (GFDL). and later known as the 
Semtner and Chervin model or the Modular 
Ocean Model (MOM) [fj]. Currently, there 
are hundreds of users within the so-called 
Bryan-Cox ocean model family, making it the 
dominant OGCM code in the climate research 
community. Furthermore, this model has been 
subjected to a high degree of optimization on 
parallel machines over the last few years oh 8]. 

The OGCM solves the 3-dimensional primi- 
tive equations with the finite difference tech- 
nique. The equations are separated into 
barotropic (the vertical mean) and baroclinic 
(departures from the vertical mean) com- 
ponents. Lhe baroclinic component is 3- 
dimensional. and uses explicit leapfrog time 
stepping. It parallelizes very well on massively 
parallel computers. Lhe barotropic component 
is 2-dimensional, and solved implicitly, [t dif- 
fers from the original Bryan-Cox formulation 
in that it removes the rigid-lid approximation 
and treats the sea surface height as a prog- 
nostic variable (i.e.. free-surface) . The free- 
st! rface model is superior to the rigid-lid model 
because it provides more accurate solution to 
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< { >de .which comprises much of the cumpula 
M< m a n< i couim ii uic;i I inn n > 11 t i rms of ! he m< h I cl . 
nii d 1 1 f e re n t parallel machines, Each of t he ma 
cliinc niiiipanvl In runs ilV own i rn piemen! a 
Mon of i hf i MPI message passing architect urc. 
I he model grid size chosen lor testing is a 2 
degree x 1 decree global ocean model with |s() 
x I N(j horizon t.al grid points and 20 vertical 
levels, which is the largest -ize that can fit 
in our Cray Id hi memory for a two PE run. 
1 lie POP 2-dimensional solver code uses a !) 
point stencil scheme with diagonal precondi- 
tioning. I tie pperf package. which takes ad- 
vantage of t he special Model Specific Register 
(MSR) of the Pentium processor, is used to 
obtain accurate time and floating point oper- 
ations (FLOP) count for each iteration of the 
solver. It usually takes several iterations for 
the solver to complete one timestep of a model 
run. i.e. for the solution to converge. Perfor- 
mance speed is defined as 

. FLOP per timestep 

Speed — - — — r — — 

execution time per timestep 

and averaging this over the total number of 
timesteps. ho examine the differences in the 
flop rate, we also looked at the ratio of compu- 
tation to communication for t he current prob- 
lem grid size, shown below in column 3. Singh* 
node performance results for the solver running 
on two processors are as follow-: 
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speed o{ POO \|{|op..-. I lie Alphas are eon 
necled by a high band wid I h , low-lalenev bid! 
reetional 3 1) inru- system ini erc< m nee! riel- 
work. I he be*'! optimization flags available arc 
applied to flie compiler for each of the above 
mac [lines. Recause of I he* com mu ideal ion over- 
heads. the net flop rale* given above is lower 
than t lie aetual flop rate in accordance with 
tin* amount of time >pent doing therommuni- 
ca I ions. 

(Jiveii that the computation to communica- 
tion ratio is about the same for the Beowulf 
and the I3E. the Beowulf flop rate is close to 
what was expected with its peak flop rate of 
about 300 Mfiop/s. half that of the T3E. On 
the other hand. It is notable that on the Ex- 
emplar, the model spents more of its time on 
communication relative to the other two ma- 
chines. .with the faster performance clue to the 
Exemplar's coherent memory caches fat least 
within a hypernode). This explains why the 
Exemplar flop rate is so much higher than ex- 
pected from its peak flop rate, which is only 
about 2.4 times faster than the Beowulf peak 
flop rate. We also looked at the solver speedup 
measurements for the above grid size, shown in 
Figure 2. using the above two PE run as the 
baseline. 
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-p* '<** 1 ii p is < uiii[)iir;i | >|e h) i Ii«it ol t lii* | 3 C. | he 
I*. :<i *m pl.t r ex hi I >il ;m interesting supe rlirirar 
peed 1 j p with the number of processors, which 
'Ac a i -■ ri bu I e to t he int rahypernode memory 
< ■ ; n • 1 1 i nu, in c< mm municat in**; data. Overall. we 
see that i hr Beowulf < ■ I u^t r performs favor- 
ably in compa rison . and cost about 10 times 
less per node titan each of other two machines. 

hi n ally, to convince ourselves ami others 
that an actual model run is feasible, we setup 
an experiment with the POP model, using re- 
alistic topography and forcing the model with 
real ocean wind, (from the Kuropean ( entrefor 
Medium Range Weather forecast (KCMWF). 
salinity, and temperature (Levitus) data. The 
model domain ranges from l OOF to 130E and 
0 to 30\ (closed wall on all four sides), with a 
resolution of 1/3 degree x 1/3 degree .and 20 
vertical levels. The sea level output at the end 
of a 120-day run is shown in Figure 3. 



Figure 3: Sea level output at end of 120 -day 
model run. with color scab' ranging from purple 
( lowest i to pin k i highest } . 


6 Conclusions 

Beowulf cla.v' PC clusters are well Milled fur 
ocean modeling application:". especially for 
-mall to medium M/ed problems. With t lie 


current trend- in P( ’ pricing and ( PI perf'or 
maiice, the Peowult eompiiling paradigm seem 
dest ined only to grow in suit a bili l y. I he at 
tractive price-! o- pertorma nee ratio means such 
machines are likely to be around lor research 
and many other non time-critical applications. 
Another major advantage in favor of such a 
cluster is the ability to use it as a dedicated 
machine without sharing com pitting resources 
with many user*, as is currently the ease with 
large expensive machines. Our P( " cluster is 
definitely not the best in quality that can be 
assembled, but certainly qualifies as a one of 
the least expensive (in the Los Angeles area), 
if not the least expensive one in terms of per- 
formance. It is not difficult to imagine a pas- 
sionate ocean modeler having a hard time de- 
ciding whether to put a Honda or a Beowulf in 
his/her own garage. 
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